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In statistics as in life, many things become clearer when we consider context. Statisti- 
cians' use of context itself becomes clearer, in fact, when we consider the past century. It was 
anathema to them prior to 1906, when Markov [22] proved the weak law of large numbers ap- 
plies to chains of dependent events over finite domains (i.e., finite-state Markov processes) Q 
He published several papers on the statistics of dependent events and in 1913 gave an exam- 
ple of dependence in language: he analyzed the first 20 000 characters of Pushkin's Eugene 
Onegin and found the likelihood of a vowel was strongly affected by the presence of vowels in 
the four preceding positions. Many other examples have been found since, in physics, chem- 
istry, biology, economics, sociology, psychology — every branch of the natural and social 
sciences. While Markov was developing the idea of Markov processes, another probability 
theorist, Borel, was starting an investigation into examples beyond their scope. Borel [2] 
defined a number to be normal in base b if, in its infinite 6-ary representation, every fc-tuple 
occurs with relative frequency l/b k ; he called a number absolutely normal if normal in every 
base. Using the Borel- Cantelli Lemma, he showed nearly all numbers are absolutely normal, 
although his proof was completely non-constructive. Sierpinski |28j gave the first example 
of an absolutely normal number but his construction is still not known to be computable. 
The first concrete example of a normal number (although not absolutely normal) was found 
by Champernowne [5] , who proved 0.1234567891011 12... — the concatenation of the 
integers — is normal in decimal. Champernowne published his number while still an un- 
dergraduate at Cambridge, where he would certainly have known of Markov processes from 
studying economics with KeynesH Champernowne was a close friend of Turing, who would 
also have known of Markov processes from studying statistics with Hardyjf) and Turing [3T] 
tried to find a concrete example of an absolutely normal numberQ It seems quite possible 
the definition of a Markov process influenced Turing's definition of a universal machine (i.e., 
Turing machine) Jf] and his search for an absolutely normal number influenced his study of 
incomputable numbers]^ After all, Markov processes were good models for the behaviourist 
psychology popular in the 1930s, Turing is famous for studying computational models of 
the human mind, and the existence of computable normal numbers shows Turing machines' 
tapes increase their computational power. 



1 Kolmogorov [TS] later extended this result to Markov processes over infinite domains. 

2 Although later famous as an economist, Keynes wrote his 1909 dissertation, A Treatise on Probability, on statistics; 
published in 1921, the update references included eleven works by Markov. 

3 Like Keynes, Turing wrote his dissertation on statistics, On the Gaussian Error Function. 

4 The question of whether there exist computable absolutely normal numbers was finally settled affirmatively in 
2002, by Becher and Figueira [T|. 

5 Link [20] discusses this point in his article on the transmission of Markov's ideas to the West. 

6 Hodges [13] discusses this point in his review of Copeland's The Essential Turing. 



Shannon [27] made extensive use of Markov processes in his seminal 1948 paper on 
information theory. He proposed that any function H(P) measuring our uncertainty about 
a random variable X that takes on values according to P = pi, . . . ,p a should have three 
properties: H should be continuous in the Pi\ if all the pi are equal, pi = 1/er, then H 
should be a monotonic increasing function of er; if a choice should be broken down into 
two successive choices, the original H should be the weighted sum of the individual values 
of H. He proved the only function with these properties is his entropy function, H(P) = 
J2i=i Pi log(l/Pi)0 This axiomatization only elucidated his main results, the Noiseless and 
Noisy Coding Theorems, as he did not use it in their proofs. The Noiseless Coding Theorem, 
in its simplest form, says the minimum expected length of a prefix-free code for the value of 
X is in the semi-closed interval [H(P), H(P) + 1); notice it cannot be applied directly when 
the probability distribution is unknown, as it is for natural written languages. For this reason, 
Shannon defined the entropy of a stationary ergodic Markov process to be, essentially, the 
limit as n goes to infinity of 1/n times the entropy of the distribution induced by the process 
over strings of length n; thus, the Noiseless Coding Theorem means that if we draw a string 
s of length n from a stationary ergodic Markov process with entropy h, then 1/n times the 
expected minimum length of a prefix-free code for s approaches h as n goes to infinity. He 
fitted zeroth-, first- and second-order Markov processes to English, gave samples of their 
output and wrote "the resemblance to ordinary English text increases quite noticeably at 
each of the above steps" , and "a sufficiently complex stochastic process will give a satisfactory 
representation of a discrete source", including "natural written languages such as English, 
German, Chinese." It seems Shannon was unaware of both Champernowne's number and 
another number — the concatenation of the primes — that Copeland and Erdos [9] had 
proven normal in decimal in 1946. The existence of such numbers invalidates the strongest 
interpretation of Shannon's claim: a program generating Champernowne's number (e.g., 
print "0."; for i > 1 {print i}), for example, cannot be represented as a (finite-state) 
Markov process. 

Chomsky [7J argued Markov processes are also inadequate models for natural language. 
For example, he famously claimed that a probabilistic model cannot determine whether a 
novel sentence is grammatical (e.g., "Colorless green ideas sleep furiously." versus "Furiously 
sleep ideas green colorless.") and that Markov processes, in particular, cannot recognize 
agreement between widely-separated words (as in, e.g., "The man who. . . is here." versus 
"The man who. . . are here." , where the ellipses replace an arbitrarily long verb phrase) Effis 
solution, proposed in a series of articles and a book between 1956 and 1959, was a hierarchy 
of grammar and language types — regular, context-free, context-sensitive and unrestricted 
- in which, he proved, the set of languages at each level is a proper superset of the set 
of languages in the classes below. This proof should have settled the debate over whether 
natural languages can be viewed as coming from Markov processes; it did in linguistics 

7 The choice of the logarithm's base determines the unit; by convention log means log 2 and the units are bits — 
each being our uncertainty about a fair coin flip, or a binary digit chosen uniformly at random. 

8 Chomsky's conclusions were sweeping: "the notion 'grammatical in English' cannot be identified in any way with 
the notion 'high order of statistical approximation to English' " and "probabilistic models give no particular insight 
into some of the basic problems of syntactic structure" . 
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and psychology, following Chomsky's devastating review [8] of Skinner's Verbal Behavior. In 
engineering, however, despite initial enthusiasm for Chomsky's ideas — he presented them 
at a symposium^ at MIT in 1956 organized by the Special Interest Group in Information 
Theory, and first published them in the IRE Transactions on Information Theory — the 
simplifying assumption that sources are Markovian remains common to this day; in a survey 
of the first fifty years of information theory, Verdu [32] relegates Chomsky's opposition to a 
footnote. 

It may be that engineers prefer Markov processes to Chomsky's grammars because a 
number of basic problems about grammars are intractable or incomputable. The most fa- 
mous of these is to find the smallest unrestricted grammar that generates precisely a given 
string; since unrestricted grammars are Turing-equivalent, this is the same as finding the 
string's Kolmogorov complexity [19] Defined independently by Solomonoff [30] in 1964, 
Kolmogorov [16] in 1965 and Chaitin [1] in 1969, a string's Kolmogorov complexity is the 
length in bits of the shortest program (in a fixed Turing-equivalent language) that outputs 
it; according to the Church- Turing thesis, this is the minimum number of bits needed to ex- 
press the string. Notice that, unlike Shannon's entropy, Kolmogorov complexity is defined for 
individual strings rather than sources and requires no probabilistic assumptions. There are 
two important facts, however, that limit its usefulness: although changing the programming 
language affects the length of the shortest program by at most an additive constant, that 
constant may be quite large (the length in bits of an interpreter for the first language writ- 
ten in the second language); more importantly, a simple diagonalization shows Kolmogorov 
complexity is incomputable and inapproximable. 

In 1976 Lempel and Ziv [18] proposed an efficiently- computable complexity metric for 
strings, based on the maximum number of distinct non-overlapping substrings they contain. 
As an example, they showed de Bruijn sequences are complex with respect to their metric; 
a a-ary de Bruijn sequence [TT] of order k is a <r-ary string containing every possible A;-tuple 
exactly once. Investigating their complexity metric led Lempel and Ziv to develop their well- 
known LZ77 [33] and LZ78 [M] compression algorithms. They used its properties to prove, 
for example, that LZ78's compression ratio is always asymptotically bounded from above 
by that of any compression algorithm implementable as a finite-state transducer, regardless 
of the source. Nevertheless, Cover and Thomas [10J in their 1991 textbook presented an 
analysis of LZ78 that assumes the source to be stationary and ergodic, despite noting in an 
earlier section that "It is not immediately obvious whether English is a stationary ergodic 
process. Probably not!". Kosaraju and Manzini [T7] introduced another complexity metric, 
empirical entropy, to re-analyze LZ77 and LZ78 in 1999. The /cth-order empirical entropy of 
a string is its minimum self-information with respect to a A;th-order Markov source, divided 
by its length; the self-information of an event with probability p is log(l/p). They considered 
families of strings in which the minimum self-information with respect to a Markov source 
is sublinear in the length, so the empirical entropy approaches as the length increases; 

9 The second day of the symposium included presentations by Newell and Simon, Chomsky, and Miller, who later 

called it "the moment of conception of cognitive science" [23j . 
10 Assuming P 7^ NP, even finding the smallest context-free grammar that generates precisely a given string is 
intractable and inapproximable to within a factor of 8569/8568 [6]. 
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Ziv and Lempel's analyses imply LZ77's and LZ78's compression ratios also approach 0, 
but Kosaraju and Manzini showed the ratios' convergence is asymptotically slower than the 
empirical entropy's — so the ratios are not generally within a constant factor of the empirical 
entropy0 

The order of an empirical entropy says how much it depends on the string's ordering. For 
example, since a Oth-order Markov source is just a probability distribution, the Oth-order 
empirical entropy is simply the entropy of the normalized distribution of characters, which 
does not depend on the order of the characters at all@ We can view the fcth-order empirical 
entropy H^s) of a string s of length n over an alphabet of size a as our expected uncertainty 
about a randomly chosen character, given a context of length k; it then follows from the 
Noiseless Coding Theorem that Hf s (s)n is a lower bound on the number of bits needed to 
encode s with any algorithm that uses contexts of length at most k. Let s[i] denote the ith 
character of s and consider the following experiment: % is chosen uniformly at random from 
{1, . . . , n}; if i < k, then we are told s[i]; otherwise, we are told s[i — k] ■ • ■ s[i — 1]. Our 
expected uncertainty about the random variable s[i] - - its expected entropy — is 



f 1 u n 
— ^2 rii log — if k = 



H k (s) = I 



n — m 



- \ s w\ ■ H (s w ) if k > 1. 



n , , , 

|ui|=fc 



Here, rij is the frequency in s of the zth character in the alphabet, and s w is the string 
obtained by concatenating the characters immediately following occurrences of string w in s 
- the length of s w is the number of occurrences of w in s unless if is a suffix of s, in which 
case it is 1 less. Notice Hk+i(s) < Hk(s) < log ex for all k. For example, if s is the string 
TORONTO, then 

H (s) = i log7 + ^log- + ^log7+ ^log- « 1.84 , 

Htis) = X - (H (s N ) + 2H (s o ) + H (s R ) + 2H (s T 

= i (H (T) + 2# (RN) + H (O) + 2H (OO) 
= 2/7^ 0.29 

and all higher-order empirical entropies of s are 0. This means if someone chooses a character 
uniformly at random from TORONTO and asks us to guess it, then our uncertainty is 



Specifically, Kosaraju and Manzini proved LZ78's compression ratio is not asymptotically bounded within a con- 
stant factor of the fcth-order empirical entropy for k > 0, nor is LZ77's for k > 1; the latter is bounded by 8 times 
the Oth-order empirical entropy plus lower-order terms. 

Indeed, the earliest bounds we know of in terms of Oth-order empirical entropy, proven by Munro and Spira [24] 
in 1976, were on the complexity of sorting a multiset. The best-known algorithm for this problem, splay-sort, is 
based on the 1985 paper in which Sleator and Tarjan [29] introduced splay-trees and analyzed their performance 
in terms of Oth-order empirical entropy. 
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about 1.84 bits. If they tell us the preceding character before we guess, then on average 
our uncertainty is about 0.29 bits; if they tell us the preceding two characters, then we 
are certain of the answer. Our ability to precisely quantify a string's empirical entropies 
distinguishes empirical entropy from an earlier notion, stochastic complexity, advocated by 
Rissanen [26] in a series of articles and books starting in 1983. The stochastic complexity (or 
minimum description length) of a string with respect to a class of sources is the minimum 
sum of the self-information of the string with respect to a source and the number of bits 
needed to represent that source. Stochastic complexity has two theoretical advantages over 
empirical entropy — it does not require the sources to be Markovian and it takes into account 
their complexities — but is much more complicated and less concrete, since the method of 
encoding sources is often unspecified. In any case, although the definition of H k (s) does not 
mention the complexity of the fcth-order Markov source with respect to which s has minimum 
self-information, this source can be specified exactly with an 0(a k+1 log(n/er fe+1 ))-bit table 
listing how often each character follows each fc-tuple in s. 

Two papers established empirical entropy as a popular complexity metric for strings, 
at least in the data structures research community. The first was Manzini's 2001 analy- 
sis [21] of the Burrows- Wheeler Transform [3], in which he proved Burrows and Wheeler's 
compression algorithm stores s in 8Hk(s)n + (/i + 2/25)n + o~ fc (2a~logcr + 9) bits for every 
k > simultaneously, where \i is a small implementation-dependent constant @ Other au- 
thors had analyzed the Burrows- Wheeler Transform previously but used "the hypothesis 
that the input comes from a finite-order Markov source [which] is not always realistic, and 
results based on this assumption are only valid on the average and not in the worst case." 
In contrast, Manzini's was a worst-case bound because "the empirical entropy resembles 
the entropy defined in the probabilistic setting (for example, when the input comes from a 
Markov source) [but] is defined for any string and can be used to measure the performance of 
compression algorithms without any assumption on the input. "0 The second was Ferragina 
and Manzini's introduction [12] of compressed full-text indices; exploiting the relationship 
between the Burrows- Wheeler Transform and suffix array data structures, they showed how 
to store s in 5Hk(s)n + o(n) bits such that, given a pattern of length £, we can find all occ 
occurrences of that pattern in s in 0(£ + occ log 1+e n) time for any k > and < e < 1. 
This result attracted a lot of attention and, subsequently, there has been a flood of research 
involving empirical entropy!^! Of course, there are still many open questions and probably 
undiscovered applications; regardless of how much context we have considered, in this case 
we cannot guess what will come next. 



If arithmetic coding is used in the algorithm's implementation, then fi ~ 1/100. 

Kaplan, Landau and Verbin [14] recently improved Manzini's bound to \Hk(s)n-\-nlog(£(X)) + 0(a k+1 log a) bits, 

where A is any constant greater than 1 and £ is the Riemann zeta function. 

Navarro and Makinen [25] recently surveyed dozens of papers on compressed full-text indices. 
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